-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CORE-8082 cloud_io: add missing error handling to remote::download_object
#24059
CORE-8082 cloud_io: add missing error handling to remote::download_object
#24059
Conversation
This is to allow passing in a `retry_chain_logger` which does not inherit from `ss::logger` but wraps it.
The call to `drain_response_stream` may throw various transport related errors (see one example below of a Broken Pipe error observed in CI). These errors should be handled inside the `remote::download_object` method because the caller's expectation is that download-related errors are communicated via the `download_result` return type rather than through an exception. Some of these errors (like the broken pipe error below) could also be retried, whereas with the previous implementation they were not retried. These exceptions are often ignored by the caller and may be printed as "Exceptional future ignored" log lines, which cause CI failures and are less useful for debugging. The below is an example of one such ignored exceptional future in the remote partition finalizing background fibre: ``` INFO 2024-10-29 12:41:17,708 [shard 1:main] cloud_storage - [fiber474 kafka/fuzzy-operator-6356-dzxvff/4] - remote_partition.cc:1406 - Finalizing remote storage state... DEBUG 2024-10-29 12:41:17,723 [shard 1:main] cloud_io - [fiber819~0|1|19984ms] - remote.cc:430 - Receive OK response from "37836c6f-30b0-482f-bb4e-0f3dffdb5cbe/meta/kafka/fuzzy-operator-6356-dzxvff/1_3447/manifest.bin" WARN 2024-10-29 12:41:17,723 [shard 1:main] http - /37836c6f-30b0-482f-bb4e-0f3dffdb5cbe/meta/kafka/fuzzy-operator-6356-dzxvff/1_3447/manifest.bin - client.cc:414 - receive error std::__1::system_error (error generic:32, System error during SSL read: [error:FFFFFFFF80000020:system library::Broken pipe]: Broken pipe) WARN 2024-10-29 12:41:17,723 [shard 1:main] seastar - Exceptional future ignored: std::__1::system_error (error generic:32, System error during SSL read: [error:FFFFFFFF80000020:system library::Broken pipe]: Broken pipe), backtrace: 0xa73be23 0xa392e05 0x360a6b8 0x9352157 0x360a71a 0xa48cc6f 0xa49045c 0xa4e77ca 0xa402f3f /opt/redpanda/lib/libc.so.6+0x961b6 /opt/redpanda/lib/libc.so.6+0x11839b ```
0825419
to
ad14537
Compare
force-push: format the bazel build file to fix the linting error |
the below tests from https://buildkite.com/redpanda/redpanda/builds/57781#01930722-dd26-462e-b3c6-242f5ac00b6e have failed and will be retried
|
non flaky failures in https://buildkite.com/redpanda/redpanda/builds/57781#0193077d-fcd8-4922-9958-3b8ff09dc64d:
|
Retry command for Build#57781please wait until all jobs are finished before running the slash command
|
@@ -53,6 +53,7 @@ redpanda_cc_library( | |||
"//src/v/utils:functional", | |||
"//src/v/utils:log_hist", | |||
"//src/v/utils:named_type", | |||
"//src/v/utils:retry_chain_node", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: probably doesn't have material impact here but as rule of thumb this should go to implementation_deps https://redpandadata.atlassian.net/wiki/spaces/CORE/pages/714670307/Bazel
/backport v24.2.x |
/backport v24.1.x |
Failed to create a backport PR to v24.2.x branch. I tried:
|
Failed to create a backport PR to v24.1.x branch. I tried:
|
The call to
drain_response_stream
may throw various transport related errors (see one example below of a broken pipe error observed in CI). These errors should be handled inside theremote::download_object
method because the caller's expectation is that download-related errors are communicated via thedownload_result
return type rather than through an exception. Some of these errors (like the broken pipe error below) could also be retried, whereas with the previous implementation, they were not retried.These exceptions are often ignored by the caller and may be printed as "Exceptional future ignored" log lines, which cause CI failures and are less useful for debugging.
The below is an example of one such ignored exceptional future in the remote partition finalizing background fibre:
This issue was highlighted by failures of the
RandomNodeOperations
test with tiered storage enabled running in CDT, which are linked below.Fixes https://redpandadata.atlassian.net/browse/CORE-8082
Fixes https://redpandadata.atlassian.net/browse/CORE-8052
Fixes https://redpandadata.atlassian.net/browse/CORE-8018
Fixes https://redpandadata.atlassian.net/browse/CORE-8009
Fixes https://redpandadata.atlassian.net/browse/CORE-7979
Fixes https://redpandadata.atlassian.net/browse/CORE-7834
Fixes https://redpandadata.atlassian.net/browse/CORE-7767
Fixes https://redpandadata.atlassian.net/browse/CORE-7675
Fixes https://redpandadata.atlassian.net/browse/CORE-7315
Backports Required
Release Notes
Bug Fixes